Attaching SeuratObject
Example Single Cell RNA-seq Analysis
Import Seurat Object
Data QC and Inspection
After creating the Seurat object, the next step is to do quality control on the data. The most common quality control is to filter out
Cells with too few genes detected. They usually represent cells which are not sequenced deep enough for reliable characterization.
Cells with too many genes detected. They may represent doublets or multiplets (i.e. two or more cells in the same droplet, therefore sharing the same cell barcode).
Cells with high mitochondrial transcript percentage. As most of the scRNA-seq experiments use oligo-T to capture mRNAs, mitochondrial transcripts should be relatively under-representative due to their lack of poly-A tails, but it is unavoidable that some mitochondrial transcripts are captured. Meanwhile, there is also some evidence that stable poly-A tails exist in some mitochondrial transcripts but serve as a marker for degradation (e.g. this paper). Together, cells with high mitochondrial transcript percentage likely represent cells under stress (e.g. hypoxia) which produce a lot of mitochondria, or which produce an abnormally high amount of truncated mitochondrial transcripts.
While numbers of detected genes are summarized by Seurat automatically when creating the Seurat object (with nFeature_RNA being the number of detected genes/features; nCount_RNA being the number of detected transcripts), one needs to calculate mitochondial transcript percentages manually. Still, Seurat provides an easy solution
Please note that there is no one-size-fits-all filtering criteria, as the normal ranges of these metrics can vary dramatically from one experiment to another, depending on sample origin as well as reagents and sequencing depths. One suggestion here is to ONLY FILTER OUT OUTLIER CELLS, i.e. the minority of cells with certain QC metrics clearly above or below the majority of cells. To do that, one needs to first know how these values are distributed in the data. One can look at the distribution by creating a violin plot for each of the metrics.
Or if you don’t like the dots (individual cells)
And as one would expect, number of detected genes and number of detected transcripts are well correlated across cells while mitochondrial transcript percentage is not.
P.S. patchwork is an R package developed to facilitate layout of plots produced by ggplot2 (Seurat uses ggplot2 to produce plots if you use the plotting functions in the Seurat package). Without patchwork, it is illegal to run plot1 + plot2.
Due to the correlation of gene number and transcript number, we only need to set a cutoff to either one of these metrics, combined with an upper threshold of mitochondrial transcript percentage, for the QC. For instance, for this data set, a detected gene number between 500 and 5000, and a mitochondrial transcript percentage lower than 5% would be quite reasonable, but it is fine to use different thresholds.
It is worth to mention that sometimes more QC may need to be applied. One potential issue is the presence of doublets. As the amount of captured RNA varies a lot from cell to cell, doublets don’t always show a higher number of detected genes or transcripts. There are several tools available now, which are designed to predict whether a ‘cell’ is indeed a singlet or actually a doublet/multiplet. DoubletFinder, for instance, predicts doublets by first constructing artificial doublets by randomly averaging cells in the data, and then for each cell testing whether it is more similar to the artificially doublets or not. This helps with the decision whether a cell is likely a doublet or not. Similarly, mitochondrial transcript percentage may not be sufficient to filter out stressed or unhealthy cells. Sometimes one would needs to do extra filtering, e.g. based on the machine learning based prediction.
Normalization and Data scaling
Normalization
Similar to bulk RNA-seq, the amount of captured RNA is different from cell to cell, and one should therefore not directly compare the number of captured transcripts for each gene between cells. A normalization step, aiming to make gene expression levels between different cells comparable, is therefore necessary. The most commonly used normalization in scRNA-seq data analysis is very similar to the concept of TPM (Transcripts Per Million reads) - one normalizes the feature expression measurements for each cell to the total expression, and then multiplies this by a scale factor (10000 by default). At the end, the resulting expression levels are log-transformed so that the expression values better fit a normal distribution. It is worth to mention that before doing the log-transformation, one pseudocount is added to every value so that genes with zero transcripts detected in a cell still present values of zero after log-transform.
Feature selection
The biggest advantage of single-cell as compared to bulk RNA-seq is the potential to look into cellular heterogeneity of samples, by looking for cell groups with distinct molecular signatures. However, not every gene has the same level of information and the same contribution when trying to identify different cell groups. For instance, genes with low expression levels, and those with similar expression levels across all cells, are not very informative and may dilute differences between distinct cell groups. Therefore, it is necessary to perform a proper feature selection before further exploring the scRNA-seq data.
In Seurat, or more general in scRNA-seq data analysis, this step usually refers to the identification of highly variable features/genes, which are genes with the most varied expression levels across cells.
VST function calculates a variance stabilizing transformation (VST) from the fitted dispersion-mean relation(s) and then transforms the count data (normalized by division by the size factors or normalization factors), yielding a matrix of values which are now approximately homoskedastic (having constant variance along the range of mean values). The transformation also normalizes with respect to library size. The rlog is less sensitive to size factors, which can be an issue when size factors vary widely. These transformations are useful when checking for outliers or as input for machine learning techniques such as clustering or linear discriminant analysis.
By default, Seurat calculates the standardized variance of each gene across cells, and picks the top 2000 ones as the highly variable features. One can change the number of highly variable features easily by giving the nfeatures option (here the top 3000 genes are used).
There is no good criteria to determine how many highly variable features to use. Sometimes one needs to go through some iterations to pick the number that gives the most clear and interpretable result. Most often, a value between 2000 to 5000 is OK and using a different value doesn’t affect the results too much.
[1] "CTGF" "CTC-340A15.2" "NPY" "KIF26B" "SRGN"
[6] "CCL3" "CNR1" "RELN" "AJ006998.2" "AC011288.2"
When using repel, set xnudge and ynudge to 0 for optimal results
Warning: Transformation introduced infinite values in continuous x-axis
Warning: Transformation introduced infinite values in continuous x-axis
Data Scaling
Since different genes have different base expression levels and distributions, the contribution of each gene to the analysis is different if no data transformation is performed. This is something we do not want as we don’t want our analysis to only depend on genes that are highly expressed. Therefore a scaling is applied to the data using the selected features, just like one usually does in any data science field.
Centering and scaling data matrix
Data Clustering (PCA/UMAP)
In principle one can start to look at cell heterogeneity after identifying highly variable genes and scaling the data. However, applying a linear dimension reduction before doing any further analysis is strongly recommended and sometimes even seen as compulsory. The benefit of doing such a dimension reduction includes but is not limited to:
The data becomes much more compact so that computation becomes much faster.
As scRNA-seq data is intrinsically sparse, summarizing measurements of related features greatly enhances the signal robustness.
PC_ 1
Positive: SYT1, RASGEF1B, GRIN1, RBFOX1, MT-CO1, MEG3, MT-ND1, MT-CO3, SLC26A3, DLGAP1
SNAP25, STXBP5L, CACNA1B, BCYRN1, SYN2, INO80D, PHYHIP, MT-CO2, CSMD1, DNM1
LINC01609, ENC1, MT-ATP6, SLC17A7, KCNQ5, WBSCR17, IQCJ-SCHIP1, CHD5, MT-ND2, RYR2
Negative: PLP1, SLC1A3, ST18, APOE, CST3, ERBB4, NEAT1, MIR219A2, MT2A, APOD
SLC1A2, PREX2, RNF219-AS1, B2M, ADGRV1, LINC01608, MOBP, SPP1, ATP1A2, LSAMP-AS1
CNDP1, FGFR3, AQP4, NHSL1, PTPRZ1, LINC00499, EMX2, COL5A3, ETNPPL, AC002429.5
PC_ 2
Positive: PTPRD, ST18, APOD, PLP1, LHFPL3, CNTNAP2, MIR219A2, LRRTM4, NXPH1, OPCML
PCDH15, MOBP, LINC01608, GRIK1, SGCZ, LUZP2, RP4-668E10.4, MMP16, ADARB2, SEMA5A
GRIA4, PDE1A, CA10, LINC01170, SMOC1, CNDP1, GRIK2, SPP1, KAZN, TNR
Negative: ADGRV1, RNF219-AS1, GPC5, SLC1A2, TPD52L1, LINC00499, AQP4, SLC4A4, BMPR1B, EMX2
COL5A3, SLC1A3, TRPM3, NKAIN3, GLIS3, ETNPPL, PAMR1, FGFR3, NHSL1, MT2A
STON2, APOE, CLU, AC002429.5, RANBP3L, GNA14, MGST1, GJA1, ATP1A2, SLCO1C1
PC_ 3
Positive: PTPRZ1, NXPH1, PCDH15, SOX6, LUZP2, GRIK1, LHFPL3, VCAN, SGCZ, MMP16
RP4-668E10.4, BRINP3, ERBB4, GRIK2, OPCML, TNR, LINC01322, KCNMB2-AS1, NKAIN3, SEMA5A
KAZN, XKR4, KIAA1211, GPC6, SMOC1, CA10, COL11A1, CSMD1, ADARB2, MIR3681HG
Negative: ST18, PLP1, MOBP, NKAIN2, LINC01608, SPP1, MIR219A2, LPAR6, ADAM28, PDE1A
CNDP1, DOCK8, FYB, APBB1IP, NPTX2, CD74, BDNF, MAML3, SRGN, NPAS4
PTPRC, EGR4, INPP5D, SYK, BLNK, HLA-DRA, OLR1, CX3CR1, ATP8B4, SAMSN1
PC_ 4
Positive: GAD2, ARL4C, GAD1, LINC00486, IGF1, ADARB2, GRIK1, FOS, RP11-123O10.4, DOCK8
ERBB4, DLX6-AS1, BTBD11, KCNIP1, MTRNR2L8, ADAM28, ZNF385D, GRIP1, DLX1, SDK1
SCG2, EGR1, LPAR6, APBB1IP, CD74, SOX1, FOSB, FYB, RP11-154D17.1, LINC01609
Negative: AC011288.2, LDB2, PTPRD, RALYL, ST6GALNAC5, NKAIN2, IQCJ-SCHIP1, RP11-380P13.1, LY86-AS1, FAM19A1
SV2B, AC114765.1, PDE1A, LINC01250, MLIP, KCNIP4, ANO3, CBLN2, CADPS2, OLFM3
FSTL4, SLIT3, CHN1, MKL2, RP11-191L9.4, AC067956.1, AJ006998.2, MIR137HG, HECW1, PHACTR1
PC_ 5
Positive: DOCK8, ADAM28, ST6GAL1, LPAR6, FYB, APBB1IP, CD74, SRGN, SYK, SRGAP2B
C10orf11, INPP5D, PTPRC, RP11-624C23.1, SRGAP2, CSF2RA, HLA-DRA, HS3ST4, P2RY12, BLNK
OLR1, ATP8B4, ARHGAP15, SAMSN1, A2M, CX3CR1, TBXAS1, CSF3R, CD86, RP11-480C22.1
Negative: PLP1, ST18, MIR219A2, MOBP, ARC, LINC01608, FOSB, NPAS4, EGR4, MTRNR2L8
CNDP1, EGR1, ERBB4, JUND, NPTX2, EGR3, PIM3, KIRREL3, LINC01170, DUSP5
INHBA, NKAIN2, RP11-154D17.1, APOD, RP11-81H3.2, SLC27A6, MIDN, KDM6B, BCOR, LBH
PC_ 1
Positive: SYT1, RASGEF1B, GRIN1, RBFOX1, MT-CO1
Negative: PLP1, SLC1A3, ST18, APOE, CST3
PC_ 2
Positive: PTPRD, ST18, APOD, PLP1, LHFPL3
Negative: ADGRV1, RNF219-AS1, GPC5, SLC1A2, TPD52L1
https://www.rdocumentation.org/packages/jackstraw/versions/1.3/topics/jackstraw
Test for association between the observed data and their systematic patterns of variations. Systematic patterns may be captured by latent variables using principal component analysis (PCA), factor analysis (FA), and related methods. The jackstraw enables statistical testing for association between observed variables and latent variables, as captured by PCs or other estimates.
Warning: Removed 21000 rows containing missing values (`geom_point()`).
We chose 10 here, but encourage users to repeat downstream analyses with a different number of PCs (10, 15, or even 50!). As you will observe, the results often do not differ dramatically.
We advise users to err on the higher side when choosing this parameter. For example, performing downstream analyses with only 5 PCs does significantly and adversely affect results.
Computing nearest neighbor graph
Computing SNN
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 8970
Number of edges: 296870
Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9382
Number of communities: 15
Elapsed time: 0 seconds
AAACCTGAGTCACGCC_Ct5_Ct6 AAACCTGCATGGATGG_Ct5_Ct6 AAACCTGGTAGCTAAA_Ct5_Ct6
1 0 0
AAACCTGGTGCTAGCC_Ct5_Ct6 AAACCTGGTGTCGCTG_Ct5_Ct6
13 1
Levels: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
A linear dimension reduction has both pros and cons. The good side is that every PC is a linear combination of gene expression so interpretation of PCs are straightforward. Also the data is compressed but not disorted, therefore information in the data is largely remained. The bad side, on the other hand, is that one usually needs more than 10 PCs to cover most of the information. This is fine for most of the analysis, but not for visualization where it is impossible to go over three dimensions for ordinary persons.
The most commonly used non-linear dimension reduction methods in scRNA-seq data analysis are t-distributed Stochastic Neighbor Embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP). In this example analysis I only use umap.
Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session
15:20:20 UMAP embedding parameters a = 0.9922 b = 1.112
15:20:20 Read 8970 rows and found 10 numeric columns
15:20:20 Using Annoy for neighbor search, n_neighbors = 30
15:20:20 Building Annoy index with metric = cosine, n_trees = 50
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
15:20:21 Writing NN index file to temp file /var/folders/xn/zm5z4n5j6t5fbn1m6pzwtqsc0000gn/T//RtmpZPvAHR/filec3e5bfd4256
15:20:21 Searching Annoy index using 1 thread, search_k = 3000
15:20:23 Annoy recall = 100%
15:20:23 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
15:20:23 Initializing from normalized Laplacian + noise (using irlba)
15:20:24 Commencing optimization for 500 epochs, with 369150 positive edges
15:20:37 Optimization finished
Markers Identification
For a more efficient implementation of the Wilcoxon Rank Sum Test,
(default method for FindMarkers) please install the limma package
--------------------------------------------
install.packages('BiocManager')
BiocManager::install('limma')
--------------------------------------------
After installation of limma, Seurat will automatically use the more
efficient implementation (no further action necessary).
This message will be shown once per session
p_val avg_log2FC pct.1 pct.2 p_val_adj
PLPP3 0 1.65 0.512 0.098 0
NFIA 0 1.50 0.874 0.360 0
CACHD1 0 1.92 0.556 0.106 0
C1orf61 0 2.00 0.805 0.227 0
ATP1A2 0 2.27 0.686 0.086 0
p_val avg_log2FC pct.1 pct.2 p_val_adj
STXBP5L 0 1.96 0.886 0.371 0
KCNIP4 0 1.87 0.969 0.511 0
KCNQ5 0 1.99 0.761 0.263 0
SYT1 0 1.79 0.957 0.398 0
MEG3 0 1.67 0.992 0.605 0
p_val avg_log2FC pct.1 pct.2 p_val_adj
CHRM3 0 3.02 0.765 0.051 0
INO80D 0 3.83 0.663 0.058 0
SYNPR 0 3.60 0.765 0.028 0
ROBO2 0 4.23 0.925 0.093 0
RASGEF1B 0 5.71 0.683 0.036 0
Calculating cluster 0
Calculating cluster 1
Calculating cluster 2
Calculating cluster 3
Calculating cluster 4
Calculating cluster 5
Calculating cluster 6
Calculating cluster 7
Calculating cluster 8
Calculating cluster 9
Calculating cluster 10
Calculating cluster 11
Calculating cluster 12
Calculating cluster 13
Calculating cluster 14
Warning in FetchData.Seurat(object = object, vars = c(dims, "ident", features),
: The following requested variables were not found: MS4A1, FCER1A, PPBP
Warning: CombinePlots is being deprecated. Plots should now be combined using
the patchwork system.
# A tibble: 150 × 7
# Groups: cluster [15]
p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
<dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
1 0 4.31 0.991 0.219 0 0 PLP1
2 0 3.98 0.911 0.107 0 0 ST18
3 0 3.72 0.963 0.285 0 0 CTNNA3
4 0 3.50 0.962 0.271 0 0 MBP
5 0 3.49 0.853 0.108 0 0 TMEM144
6 0 3.25 0.609 0.045 0 0 LINC01608
7 0 3.19 0.748 0.113 0 0 MIR219A2
8 0 3.13 0.633 0.059 0 0 MOBP
9 0 3.09 0.681 0.092 0 0 TF
10 0 3.08 0.696 0.1 0 0 CLDN11
# … with 140 more rows
Warning: CombinePlots is being deprecated. Plots should now be combined using
the patchwork system.
R version 4.2.1 (2022-06-23)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur ... 10.16
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0 dplyr_1.1.0
[5] purrr_1.0.1 readr_2.1.4 tidyr_1.3.0 tibble_3.2.0
[9] ggplot2_3.4.1 tidyverse_2.0.0 patchwork_1.1.2 SeuratObject_4.1.3
[13] Seurat_4.3.0 conflicted_1.2.0
loaded via a namespace (and not attached):
[1] Rtsne_0.16 colorspace_2.1-0 deldir_1.0-6
[4] ellipsis_0.3.2 ggridges_0.5.4 rstudioapi_0.14
[7] spatstat.data_3.0-1 farver_2.1.1 leiden_0.4.3
[10] listenv_0.9.0 ggrepel_0.9.3 fansi_1.0.4
[13] codetools_0.2-19 splines_4.2.1 cachem_1.0.7
[16] knitr_1.42 polyclip_1.10-4 jsonlite_1.8.4
[19] ica_1.0-3 cluster_2.1.4 png_0.1-8
[22] uwot_0.1.14 shiny_1.7.4 sctransform_0.3.5
[25] spatstat.sparse_3.0-1 compiler_4.2.1 httr_1.4.5
[28] Matrix_1.5-3 fastmap_1.1.1 lazyeval_0.2.2
[31] cli_3.6.0 later_1.3.0 htmltools_0.5.4
[34] tools_4.2.1 igraph_1.4.1 gtable_0.3.2
[37] glue_1.6.2 RANN_2.6.1 reshape2_1.4.4
[40] Rcpp_1.0.10 scattermore_0.8 vctrs_0.5.2
[43] nlme_3.1-162 spatstat.explore_3.1-0 progressr_0.13.0
[46] lmtest_0.9-40 spatstat.random_3.1-4 xfun_0.37
[49] globals_0.16.2 timechange_0.2.0 mime_0.12
[52] miniUI_0.1.1.1 lifecycle_1.0.3 irlba_2.3.5.1
[55] goftest_1.2-3 future_1.32.0 MASS_7.3-58.3
[58] zoo_1.8-11 scales_1.2.1 hms_1.1.2
[61] promises_1.2.0.1 spatstat.utils_3.0-2 parallel_4.2.1
[64] RColorBrewer_1.1-3 yaml_2.3.7 memoise_2.0.1
[67] reticulate_1.28 pbapply_1.7-0 gridExtra_2.3
[70] stringi_1.7.12 rlang_1.1.0 pkgconfig_2.0.3
[73] matrixStats_0.63.0 evaluate_0.20 lattice_0.20-45
[76] ROCR_1.0-11 tensor_1.5 labeling_0.4.2
[79] htmlwidgets_1.6.2 cowplot_1.1.1 tidyselect_1.2.0
[82] parallelly_1.34.0 RcppAnnoy_0.0.20 plyr_1.8.8
[85] magrittr_2.0.3 R6_2.5.1 generics_0.1.3
[88] DBI_1.1.3 withr_2.5.0 pillar_1.8.1
[91] fitdistrplus_1.1-8 survival_3.5-5 abind_1.4-5
[94] sp_1.6-0 future.apply_1.10.0 KernSmooth_2.23-20
[97] utf8_1.2.3 spatstat.geom_3.1-0 plotly_4.10.1
[100] tzdb_0.3.0 rmarkdown_2.20 grid_4.2.1
[103] data.table_1.14.8 digest_0.6.31 xtable_1.8-4
[106] httpuv_1.6.9 munsell_0.5.0 viridisLite_0.4.1